-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dws: crudely enforce mdt count constraints #181
dws: crudely enforce mdt count constraints #181
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a couple of minor clarification questions. Nice work!
count = alloc_set["constraints"]["count"] | ||
server_alloc_set["allocationSize"] = math.ceil( | ||
alloc_set["minimumCapacity"] / count | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably just a dumb question, but could count
ever be 0 in case the key-value lookup doesn't succeed for whatever reason? Could maybe be helpful to check before attempting this division, although Python generally doesn't freak out when attempting division by 0, just raises an error :) I'll leave it up to you whether that could be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think count
could be 0, but then something in the k8s environment must have been misconfigured and would need tweaking by an admin. I think the zerodivisionerror would be OK here. Good catch though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds great - just wanted to make sure. thanks!!
) | ||
count -= 1 | ||
if count == 0: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you change the while
condition to >= 1
, maybe you don't need to add this extra if
check at the end of every loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how >=1
would be different than > 0
? Also the if
check is just to break out of the inner loop, which would otherwise proceed without regard to the value of count
--I wanted to make sure that exactly count
allocations were made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha - my mistake!! Thanks for the clarification
Problem: directivebreakdown.py has a function that isn't used anywhere. Remove it.
Problem: the logic for patching Servers resources with allocation info is long and complicated. Break a chunk of it out into the directivebreakdown module.
Problem: directivebreakdown.py has an erroneous return statement in a generator function, an unused variable, and is missing some docstrings. Clean up the module.
Problem: as described in issue flux-framework#171, creating many MDTs is a bad for performance, and usually goes against what is explicitly required by directivebreakdown resources. However, there is not yet a good way to get Fluxion to handle MDT allocation. Bypass Fluxion allocation completely, and tell DWS to create exactly the number of allocations requested in the .constraints.count field (which is usually found on MDTs). Place the allocations on the rabbits which have the most compute nodes allocated to the job. This is intended to be only a temporary solution, since it adds a new potential problem, in that some rabbit storage is used which is not tracked by Fluxion. This could lead to overallocation of resources, causing jobs to fail with errors. However, this seems unlikely to occur in practice, since MDTs are small and Fluxion always gives jobs more storage than they asked for, so there should usually be some spare storage.
7330013
to
e80fe39
Compare
Thanks for the review @cmoussa1 ! Setting MWP. |
Problem: as described in #171, creating many MDTs is a bad
for performance, and usually goes against what is explicitly
required by directivebreakdown resources. However, there is not
yet a good way to get Fluxion to handle MDT allocation.
Bypass Fluxion allocation completely, and tell DWS to create
exactly the number of allocations requested in the
.constraints.count field (which is usually found on MDTs).
Place the allocations on the rabbits which have the most compute
nodes allocated to the job.
This is intended to be only a temporary solution, since it
adds a new potential problem, in that some rabbit storage is
used which is not tracked by Fluxion. This could lead to
overallocation of resources, causing jobs to fail with errors.
However, this seems unlikely to occur in practice, since MDTs
are small and Fluxion always gives jobs more storage than they
asked for, so there should usually be some spare storage.